# Vision-Text Alignment
Ovis2 1B Dev
Apache-2.0
Ovis2-1B is the latest member of the Ovis series of multimodal large language models (MLLM), focusing on structural alignment of vision and text embeddings, featuring high performance for small models, enhanced reasoning capabilities, video and multi-image processing, and multilingual OCR enhancement.
Text-to-Image
Transformers Supports Multiple Languages

O
Isotr0py
79
1
Siglip So400m 14 980 Flash Attn2 Navit
Apache-2.0
SigLIP-based vision model that enhances maximum resolution to 980x980 through interpolated positional embeddings and implements NaViT strategy for variable resolution and aspect ratio-preserving image processing
Text-to-Image
Transformers

S
HuggingFaceM4
4,153
46
Chinese Clip Vit Large Patch14
Chinese CLIP model, based on VIT architecture, supports Chinese vision-language tasks
Image Classification
Transformers

C
OFA-Sys
2,333
32
Vit Base Patch16 Clip 224.openai
Apache-2.0
CLIP is a vision-language model developed by OpenAI, trained via contrastive learning for image and text encoders, supporting zero-shot image classification.
Text-to-Image
Transformers

V
timm
618.17k
7
Clip Vit Base Patch16
CLIP is a multimodal model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, enabling zero-shot image classification capabilities.
Image-to-Text
C
openai
4.6M
119
Featured Recommended AI Models